Goto

Collaborating Authors

 reliability diagram


A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

arXiv.org Machine Learning

Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.



Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

arXiv.org Machine Learning

Neural network classifiers trained with cross-entropy loss achieve strong predictive accuracy but lack the capability to provide inherent predictive uncertainty estimates, thus requiring external techniques to obtain these estimates. In addition, softmax scores for the true class can vary substantially across independent training runs, which limits the reliability of uncertainty-based decisions in downstream tasks. Evidential Deep Learning aims to address these limitations by producing uncertainty estimates in a single pass, but evidential training is highly sensitive to design choices including loss formulation, prior regularization, and activation functions. Therefore, this work introduces an alternative Dirichlet parameter estimation strategy by applying a method of moments estimator to ensembles of softmax outputs, with an optional maximum-likelihood refinement step. This ensemble-based construction decouples uncertainty estimation from the fragile evidential loss design while also mitigating the variability of single-run cross-entropy training, producing explicit Dirichlet predictive distributions. Across multiple datasets, we show that the improved stability and predictive uncertainty behavior of these ensemble-derived Dirichlet estimates translate into stronger performance in downstream uncertainty-guided applications such as prediction confidence scoring and selective classification.


sidedCalibrationTheorem

Neural Information Processing Systems

Theorem 2. Suppose that the predictive distribution Q has the sufficient ability to approximate the true unknown distribution P, given data is i.i.d. Lm(P,Q) = 0 if and only if P = Q when F is a unit ball in a universal RKHS [13]. Becausetheconfidencelevelp2 p1 is exactly equal to the proportion of samples {y1,,yn} covered by the two-sided prediction interval. B.1 Baselines MC-Dropout (MCD) [12]: A variant of standard dropout, named as Monte-Carlo Dropout. Heteroscedastic Neural Network (HNN) [17]: In this approach, similar to a heteroscedastic regression, the network has two outputs in the last layer, corresponding to the predicted mean and variance for each input xi.


Deep Linear Discriminant Analysis Revisited

arXiv.org Machine Learning

We show that for unconstrained Deep Linear Discriminant Analysis (LDA) classifiers, maximum-likelihood training admits pathological solutions in which class means drift together, covariances collapse, and the learned representation becomes almost non-discriminative. Conversely, cross-entropy training yields excellent accuracy but decouples the head from the underlying generative model, leading to highly inconsistent parameter estimates. To reconcile generative structure with discriminative performance, we introduce the \emph{Discriminative Negative Log-Likelihood} (DNLL) loss, which augments the LDA log-likelihood with a simple penalty on the mixture density. DNLL can be interpreted as standard LDA NLL plus a term that explicitly discourages regions where several classes are simultaneously likely. Deep LDA trained with DNLL produces clean, well-separated latent spaces, matches the test accuracy of softmax classifiers on synthetic data and standard image benchmarks, and yields substantially better calibrated predictive probabilities, restoring a coherent probabilistic interpretation to deep discriminant models.


Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

arXiv.org Machine Learning

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.


An exact multiple-time-step variational formulation for the committor and the transition rate

arXiv.org Artificial Intelligence

For a transition between two stable states, the committor is the probability that the dynamics leads to one stable state before the other. It can be estimated from trajectory data by minimizing an expression for the transition rate that depends on a lag time. We show that an existing such expression is minimized by the exact committor only when the lag time is a single time step, resulting in a biased estimate in practical applications. We introduce an alternative expression that is minimized by the exact committor at any lag time. The key idea is that, when trajectories enter the stable states, the times that they enter (stopping times) must be used for estimating the committor and transition rate instead of the lag time. Numerical tests on benchmark systems demonstrate that our committor and transition rate estimates are much less sensitive to the choice of lag time. We show how further accuracy for the transition rate can be achieved by combining results from two lag times. We also relate the transition rate expression to a variational approach for kinetic statistics based on the mean-squared residual and discuss further numerical considerations with the aid of a decomposition of the error into dynamic modes.


Uncertainty Calibration of Multi-Label Bird Sound Classifiers

arXiv.org Artificial Intelligence

Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt$_{BS}$ show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.


A Evaluation Metrics A.1 Expected Calibration Error (ECE) Expected calibration error (ECE) [

Neural Information Processing Systems

The difference between acc and conf can be intuitively seemed as the deviation of the outputs to the diagonal in Figure 1. The higher the accuracy of predictions is, the lower BS is. The details of these datasets are summarized in Table 5. Our data are public and do not contain personally identifiable information and offensive content. The learning rate is 0.01 and the maximum number of iteration is 50.


Semantic-Aware Gaussian Process Calibration with Structured Layerwise Kernels for Deep Neural Networks

arXiv.org Artificial Intelligence

Calibrating the confidence of neural network classifiers is essential for quantifying the reliability of their predictions during inference. However, conventional Gaussian Process (GP) calibration methods often fail to capture the internal hierarchical structure of deep neural networks, limiting both interpretability and effectiveness for assessing predictive reliability. We propose a Semantic-Aware Layer-wise Gaussian Process (SAL-GP) framework that mirrors the layered architecture of the target neural network. Instead of applying a single global GP correction, SAL-GP employs a multi-layer GP model, where each layer's feature representation is mapped to a local calibration correction. These layerwise GPs are coupled through a structured multi-layer kernel, enabling joint marginalization across all layers. This design allows SAL-GP to capture both local semantic dependencies and global calibration coherence, while consistently propagating predictive uncertainty through the network. The resulting framework enhances interpretability aligned with the network architecture and enables principled evaluation of confidence consistency and uncertainty quantification in deep models.